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ABSTRACT 

The  last  decade  witnessed  a  manifest  shift  in  the  micropro¬ 
cessor  industry  towards  chip  designs  that  promote  parallel  com¬ 
puting.  Until  recently  the  privilege  of  a  select  group  of  large 
research  centers,  Teraflop  computing  is  becoming  a  commod¬ 
ity  owing  to  inexpensive  GPU  cards  and  multi  to  many -core 
x86  processors.  This  paradigm  shift  towards  large  scale  par¬ 
allel  computing  has  been  leveraged  in  Chrono,  a  freely  available 
C+  +  multi-physics  simulation  package.  Chronois  made  up  of  a 
collection  of  loosely  coupled  components  that  facilitate  differ¬ 
ent  aspects  of  multi-physics  modeling,  simulation,  and  visualiza¬ 
tion.  This  contribution  provides  an  overview  of  Chrono: : Engine, 
Chrono:: Flex,  Chrono:: Fluid,  and  Chrono:: Render,  which  are 
modules  that  can  capitalize  on  the  processing  power  of  hun¬ 
dreds  of  parallel  processors.  Problems  that  can  be  tackled  in 
Chronoinclude  but  are  not  limited  to  granular  material  dynam¬ 
ics,  tangled  large  flexible  structures  with  self  contact,  particu¬ 
late  flows,  and  tracked  vehicle  mobility.  The  paper  presents  an 
overview  of  each  of  these  modules  and  illustrates  through  several 
examples  the  potential  of  this  multi-physics  library. 
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INTRODUCTION 

Over  the  last  decade  there  has  been  a  manifest  trend  in  the 
hardware  industry  to  increase  flop/s  rates  by  increasing  the  num¬ 
ber  of  cores  available  on  a  processor.  To  a  very  large  extent,  the 
tide  that  has  risen  sequential  computing  for  several  decades  is 
subsiding.  The  frequency  at  which  cores  are  operated  today  has 
at  best  plateaued;  in  many  cases,  it  went  down  in  an  attempt  to 
tame  power  dissipation  and  overheating.  Instruction  level  paral¬ 
lelism  advances  that  ensured  respectable  gains  through  pipelin¬ 
ing  and  out  of  order  execution  have  largely  fulfilled  their  poten¬ 
tial.  The  bright  spot  in  this  evolving  hardware  landscape  has 
been  the  growing  impetus  behind  parallel  computing  hardware. 
If  anything  has  held  steady  over  the  last  four  decades,  it  has  been 
the  pace  at  which  transistors  are  packed  per  unit  area  in  computer 
chips.  This  trend  allows  today  chip  designs  that  draw  on  22  nm 
feature  length.  Intel’s  road  map  calls  for  14  nm  technology  in 
2014,  10  nm  in  2016,  7  nm  in  2018,  and  5  nm  in  2020.  In  other 
words,  the  number  of  transistors  per  unit  area  will  continue  to 
double  every  two  years  for  the  current  decade.  This  will  trans¬ 
late  into  immediate  access  to  commodity  chips  that  host  multiple 
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compute  cores.  Given  the  stagnation  in  processor  operating  fre¬ 
quency,  an  ever  growing  gap  between  CPU  speed  and  memory 
speed,  and  the  waning  of  instruction  level  parallelism  gains,  it 
becomes  apparent  that  the  only  way  we  can  continue  to  enjoy  re¬ 
duced  simulation  times  or  ability  to  rely  on  refined  models  is  to 
fall  back  on  parallel  computing.  There  are  two  major  directions 
in  which  parallel  computing  has  evolved.  The  x86  architecture 
has  defined  a  solution  that  evolved  as  a  steady  and  predictable 
process  in  which  the  number  of  cores  on  a  chip  increased  over 
time:  AMD  produces  today  16  core  chips,  while  Intel  has  12  core 
processors.  Leveraging  these  chips  requires  a  low  entry  point  that 
calls  for  programming  against  relatively  mature  libraries  such  as 
OpenMP,  MPI,  pthreads,  cilk,  TBB,  etc.  At  memory  bandwidths 
of  75  GB/s  and  flop/s  rates  of  0.3  TFlop/s,  this  has  tradition¬ 
ally  represented  the  conservative  choice  for  entering  the  parallel 
computing  arena.  With  the  release  of  CUDA  1 .0  in  2006,  Nvidia 
offered  a  second  alternative  to  leveraging  parallel  computing  by 
programming  the  ubiquitous  video  cards  available  on  millions 
of  desktops  worldwide.  This  path  to  parallel  computing  is  less 
conventional  as  it  requires  one  to  get  familiar  with  the  hardware 
layout  and  memory  hierarchy  associated  with  GPUs.  Today,  an 
Nvidia  GPU  has  close  to  seven  billion  transistors.  Priced  at  about 
$6000,  an  Nvidia  Kepler  K20x  delivers  a  memory  bandwidth  of 
250  GB/s  and  1.3  TFlop/s  by  virtue  of  using  more  than  2800 
Scalar  Processors.  It  is  used  side  by  side  with  a  regular  CPU  pro¬ 
cessor,  which  means  that  heterogeneous  computing,  on  the  CPU 
and  GPU,  can  lead  to  substantial  speed  gains.  In  this  framework, 
the  GPU  plays  the  role  of  an  accelerator  by  boosting  the  floating 
point  performance  of  the  CPU.  A  similar  setup  is  offered  now  by 
Intel;  i.e.,  CPU  plus  accelerator,  owing  to  its  recent  release  of  the 
Knights  Corner  architecture.  A  Knights  Corner  chip  has  about 
60  cores,  can  deliver  up  to  320  GB/s  and  1  TFlop/s,  and  uses  the 
x86  instruction  set  architecture,  which  translates  into  an  easier 
adoption  path  provided  one  is  familiar  with  OpenMP  or  MPI. 

It  becomes  apparent  that  in  the  immediate  future,  any  in¬ 
crease  in  simulation  speed  or  model  complexity  in  Computa¬ 
tional  Science  will  be  fueled  by  parallel  computing.  This  paper 
outlines  an  ongoing  effort  in  the  area  of  computational  mutli- 
body  dynamics  that  is  motivated  by  this  belief.  It  starts  with 
a  description  of  a  core  simulation  engine  that  aims  at  simula¬ 
tion  of  many-body  dynamics  problems  with  friction  and  contact. 
Chrono:: Engine  handles  both  rigid  and  flexible  bodies  and  draws 
on  MPI  and/or  GPU  computing.  It  then  discusses  Chrono:: Fluid, 
a  GPU  parallel  simulation  tool  that  aims  at  fluid- solid  interaction 
problems,  which  is  singled  out  as  an  application  area  that  has 
been  largely  ignored  until  recently  due  to  an  excessive  computa¬ 
tional  burden  incurred  by  the  simulation  of  systems  of  practical 
relevance.  Finally,  the  papers  outlines  a  rendering  pipeline  that  is 
used  for  postprocessing  of  big  data.  Chrono:: Render  is  capable 
of  using  320  cores  and  is  built  around  Pixar’s  RenderMan.  All 
these  components  combine  to  produce  Chrono,  a  multi-physics 
simulation  environment  that  is  designed  to  take  advantage  of 


commodity  parallel  computing  made  available  by  many-core  and 
GPU  architectures. 


Chrono::Engine 

The  Chrono:: Engine  software  is  a  general-purpose  simula¬ 
tor  for  three  dimensional  multi-body  problems  [1].  Specifically, 
the  code  is  designed  to  support  the  simulation  of  very  large  sys¬ 
tems  such  as  those  encountered  in  granular  dynamics,  where  the 
number  of  interacting  elements  can  be  in  the  millions.  Target 
applications  include  tracked  vehicles  operating  on  granular  ter¬ 
rain  [2]  or  the  Mars  Rover  operating  on  discrete  granular  soil. 
In  these  applications,  it  is  desirable  to  model  the  granular  ter¬ 
rain  as  a  collection  of  many  thousands  or  millions  of  discrete 
bodies  interacting  through  contact,  impact,  and  friction.  Note 
that  such  systems  also  include  mechanisms  composed  of  rigid 
bodies  and  mechanical  joints.  These  challenges  require  an  ef¬ 
ficient  and  robust  simulation  tool,  which  has  been  developed  in 
the  Chronosimulation  package.  Chrono:: Engine  was  initially  de¬ 
veloped  leveraging  the  Differential  Variational  Inequality  (DVI) 
formulation  as  an  efficient  method  to  deal  with  problems  that 
encompass  many  frictional  contacts  -  a  typical  bottleneck  for 
other  types  of  formulations  [3,4].  This  approach  enforces  non¬ 
penetration  between  rigid  bodies  through  constraints,  leading  to 
a  cone-constrained  quadratic  optimization  problem  which  must 
be  solved  at  each  time  step  [5].  Chrono:: Engine  has  since  been 
extended  to  support  the  Discrete  Element  Method  (DEM)  for¬ 
mulation  for  handling  the  frictional  contacts  present  in  granular 
dynamics  problems  [6,7].  This  formulation  computes  contact 
forces  by  penalizing  small  interpenetrations  of  colliding  rigid 
bodies.  Various  contact  force  models  can  be  used  depending  on 
the  application  [8,9]. 

The  remainder  of  this  section  describes  the  features  of 
Chrono:: Engine,  starting  with  the  structure  of  the  code.  Next, 
several  sub- sections  describe  the  use  of  GPU  computing  in  the 
collision  detection  task,  the  use  of  MPI  for  distributed  solution 
of  large  systems,  and  validation  work  which  has  been  done  to 
assess  the  accuracy  of  the  simulation  tool. 

Code  Structure  of  Chrono::Engine 

The  core  of  Chrono:: Engine  is  built  around  the  concept  of 
middleware,  namely  a  layer  of  classes  and  functions  that  can 
be  used  by  third-party  developers  to  create  complex  mechanical 
simulation  software  with  little  effort  [10].  Because  of  this,  graph¬ 
ical  user  interfaces  and  end-user  tools  are  not  the  main  focus  of 
the  CHRONO:: Engine  core  project;  it  is  assumed  that  programs 
with  graphical  interfaces  are  built  on  top  of  such  middleware,  or 
should  be  considered  as  additional,  or  optional,  modules. 

Given  the  complexity  of  the  project,  approaching  half  a 
million  lines  of  code,  the  software  is  organized  in  classes  and 
namespaces  as  recommended  by  the  Object  Oriented  Program- 
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ming  paradigm,  targeting  modularity,  encapsulation,  reusability 
and  polymorphism.  The  libraries  of  Chrono:: Engine  are  thread 
safe,  fully  re-entrant,  and  include  more  than  six  hundred  C++ 
classes.  Objects  from  these  classes  can  be  instanced  and  used  to 
define  models  and  simulations  that  run  in  third  party  software, 
for  instance  vehicle  simulators,  CAD  tools,  virtual  reality  appli¬ 
cations,  or  robot  simulators. 

Chrono:: Engine  is  completely  platform-independent,  hence 
libraries  are  available  for  Windows,  Linux  and  Mac  OSx,  for 
both  32  bit  and  64  bit  versions.  Moreover,  we  followed  a  modular 
approach,  splitting  the  libraries  in  modules  that  can  be  dynami¬ 
cally  loaded  only  if  necessary,  thus  minimizing  issues  of  depen¬ 
dency  from  other  libraries  and  reducing  memory  footprint.  For 
instance,  we  developed  libraries  for  MATLAB  interoperability, 
for  real-time  visualization  through  OpenGL,  for  interfacing  with 
post-processing  tools,  etc.  (see  Figure  1). 


and  so  on.  A  set  of  more  than  thirty  mechanical  constraints 
are  part  of  this  class  hierarchy.  Furthermore,  the  architecture 
is  open  to  further  definition  of  new  specialized  classes  for  user- 
customized  parts  and  joints.  An  object  of  ChSystem  class 
stores  a  list  of  all  moving  parts  and  performs  the  simulation. 


chrono::ChObj  chrono::ChFrame<  double  > 


chrono::ChPhysicsltem  chrono::ChFrameMoving<  double  > 


chrono::ChBodyDEMMPI 


^ 

^ 

Example  C++  program  'A' 

Example  C++  program  'B' 

FIGURE  1 .  UML  graph  of  dependencies  between  module  libraries. 

Classes  and  objects  have  been  tested  and  profiled  for  fast 
execution,  in  order  to  achieve  real-time  performance  when  pos¬ 
sible.  Modern  programming  techniques  have  been  adopted, 
like  metaprogramming,  class  templating,  class  factories,  mem¬ 
ory  leak  trackers  and  persistent-transient  data  mapping.  C++ 
operator  overloading  has  been  used  to  provide  a  compact  alge¬ 
bra  to  manage  quaternions,  static  and  moving  coordinate  sys¬ 
tems,  and  OS-agnostic  classes  are  used  for  logging,  stream¬ 
ing/checkpointing  and  exception  handling. 

We  embraced  an  intense  object-oriented  approach,  therefore 
most  C++  objects  that  define  parts  of  the  multi-body  model  are 
inherited  from  a  base  class  called  ChPhys  ics  1 1  em,  which  de¬ 
fines  the  essential  interfaces  for  all  items  that  have  some  de¬ 
grees  of  freedom.  For  example,  specialized  classes  that  in¬ 
herit  the  ChPhys  ics  Item  are  the  ChBody  class,  which  is 
used  for  3D  rigid  bodies  as  shown  in  Fig. 2,  Ch Shaft,  which 
is  used  for  ID  concentrated  parameter  models  of  power  trains, 
ChLinkLockRevolute  that  is  a  joint  between  rigid  bodies, 


FIGURE  2.  Class  ineritance  diagram  for  objects  of  ChBody  type. 


FIGURE  3.  Collaboration  graph  between  classes:  example  for 

ChBody  and  ChSystem. 

Each  ChPhys  ics  Item-inheriting  class  can  encapsulate  a 
variable  number  of  ChLcpVariable  objects  and/or  a  vari¬ 
able  number  of  ChLcpConst  raint  objects,  that  are  fed  to  the 
solver  for  Cone  Complementarity  Problems  (CCP)  at  each  time 
step  of  the  DVI  integration;  this  helps  the  development  of  black¬ 
box  CCP  solvers  that  are  independent  from  the  data  structures  of 
the  physical  layer.  Also,  these  data  structures  represent  the  sparse 
data  for  the  model  description,  which  is  completely  matrix-  and 
vector-free  for  the  sake  of  a  small  memory  footprint  and  fast  lin¬ 
ear  algebra.  Specifically,  tthe  system  matrices  for  mass,  Jaco- 
bians,  etc.  are  never  explicitly  assembled. 

The  objects  of  most  of  the  above  mentioned  classes  are  man¬ 
aged  by  smart  (shared)  pointers  with  automatic  deletion.  This 
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relieves  the  programer  from  the  burden  of  taking  care  of  ob¬ 
ject’s  lifetime,  given  that  the  relationships  between  objects  can 
be  quite  complex  as  illustrated  in  Fig. 3.  A  large  portion  of  the 
C++  classes  are  available  also  as  Python  modules;  this  enables 
the  use  of  most  simulation  features  in  a  scripted  environment. 
Since  novice  users  are  more  comfortable  with  Python  than  with 
C++,  the  Python  interface  proved  to  be  optimal  for  teaching  pur¬ 
poses.  The  Python  interface  was  produced  using  the  SWIG  util¬ 
ity,  a  process  that  automatically  generates  the  code  for  the  Python 
wrapper. 

The  software  architecture  has  been  designed  to  accommo¬ 
date  an  expandable  system  for  handling  assets  (meshes,  textures, 
CAD  models),  with  multiple  paths  from  pre-processing  to  post¬ 
processing.  To  this  end,  we  also  provide  a  C#  add-in  for  a  para¬ 
metric  3D  CAD  package  (SolidWorks)  that  can  be  used  to  export 
models  into  Chrono:: Engine  without  programming  efforts  (see 
Fig-4). 


Co-simulation 

TCP  socket 


MATLAB  U;, 
“SIMI 


FIGURE  4.  Network  of  asset  workflows. 


Collision  Detection  in  Chrono::Engine 

This  section  describes  the  collision  detection  algorithm  de¬ 
signed  and  implemented  for  the  Chrono:: Engine  package.  Recall 
that  problems  of  interest  are  focused  on  granular  dynamics,  such 
as  sand  flowing  inside  an  hourglass,  a  rover  running  over  sandy 
terrain,  an  excavator/frontloader  digging/loading  granular  mate- 
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FIGURE  5.  Example  of  A  ABB  generation  for  3D  cylinder. 


rial,  etc.  In  this  context,  the  collision  detection  task  is  performed 
on  a  rather  small  collection  of  rigid  and/or  deformable  bodies  of 
complex  geometry  (hourglass  wall,  wheel,  track  shoe,  excava¬ 
tor  blade,  dipper),  and  a  very  large  number  of  bodies  (millions 
to  billions)  that  make  up  the  granular  material.  On  this  scale, 
the  collision  detection  task,  particularly  when  dealing  with  the 
granular  material,  fits  perfectly  the  Single  Instruction  Multiple 
Data  (SIMD)  computation  paradigm.  Specifically,  the  same  se¬ 
quence  of  instructions  needs  to  be  applied  to  every  individual 
body  and/or  contact  in  the  granular  material.  Therefore,  a  colli¬ 
sion  detection  algorithm  capable  of  leveraging  the  SIMD  com¬ 
pute  power  of  commodity  Graphics  Processing  Units  (GPUs) 
was  developed  and  implemented  to  remove  collision  detection 
as  the  bottleneck  in  large  granular  dynamics  simulations. 

The  parallel  collision  detection  algorithm  is  separated  into 
two  phases,  broadphase,  and  narrowphase.  The  broadphase  al¬ 
gorithm  quickly  determines  a  list  of  potential  contact  pairs  while 
the  narrowphase  algorithm  determines  actual  contact  informa¬ 
tion.  A  brief  outline  of  the  parallel  collision  detection  algorithm 
is  presented  below,  for  more  details  see  [11, 12, 13]. 

Broad-Phase  Algorithm  The  Broad-Phase  algorithm  is 
used  to  compute  whether  two  bodies  might  be  in  contact  at  a 
given  time.  The  purpose  of  the  broad-phase  algorithm  is  not  to 
find  actual  contact  information,  but  rather  to  determine  if  a  con¬ 
tact  could  potentially  occur  based  on  the  AABBs  of  the  bodies 
involved. 

An  Axis  Aligned  Bounding  Box  (AABB)  is  a  special  case 
of  a  bounding  box  that  is  always  aligned  to  the  global  reference 
frame,  simplifying  collision  detection  as  the  bounding  box  can¬ 
not  rotate.  Because  of  this,  the  volume  enclosed  by  the  bounding 
box  will  always  be  equal  to  or  greater  than  the  volume  of  the 
shape  it  encloses.  AABB  generation  is  simple  and  can  be  easily 
paralellized  on  a  per  object  basis.  See  Fig.  5  for  an  example  of 
AABB  computation  for  a  cylinder  in  3D  space. 

Spatial  Subdivision  Algorithm  A  high-level  overview 
of  the  GPU-based  collision  detection  is  as  follows.  The  collision 
detection  process  starts  by  identifying  the  intersections  between 
AABBs  and  bins  (see  Fig.  6  for  a  visual  representation  of  a  bin). 
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FIGURE  6.  Example  of  3D  space  divided  into  bins. 

The  AABB-bin  pairs  are  subsequently  sorted  by  bin  id.  Next, 
each  bin’s  starting  index  is  determined  so  that  the  bins’  AABBs 
can  be  traversed  sequentially.  All  AABBs  touching  a  bin  are 
subsequently  checked  against  each  other  for  collisions. 

Narrow-Phase  Algorithm  Once  potential  contacts  have 
been  determined  from  the  broad-phase  collision  detection  stage, 
the  Narrow-Phase  algorithm  needs  to  process  each  possible  con¬ 
tact  and  determine  if  it  actually  occurs.  To  this  end  an  algorithm 
capable  of  determining  contacts  between  convex  geometries  was 
implemented  on  the  GPU.  This  algorithm,  called  “XenoCollide” 
[14],  is  based  upon  Minkowski  Portal  Refinement  (MPR)  [15]. 

Using  MPI  for  distributed  Chrono 

Chronohas  been  further  extended  to  allow  the  use  of  CPU 
parallelism  for  certain  problems.  To  efficiently  simulate  large 
systems,  a  domain  decomposition  approach  has  been  developed 
to  allow  the  use  of  many-core  compute  clusters.  In  this  approach, 
we  divide  the  simulation  domain  into  a  number  of  sub-domains 
in  a  lattice  structure.  Each  sub-domain  manages  the  simulation 
of  all  bodies  contained  therein.  Note  that  bodies  may  span  the 
boundary  between  adjacent  sub-domains.  In  this  case,  the  body 
is  considered  shared  and  its  dynamics  may  be  influenced  by  the 
participating  sub-domains.  The  implementation  leverages  the 
MPI  standard  [16]  to  implement  the  necessary  communication 
and  synchronization  between  sub-domains. 

This  approach  enables  the  simulation  of  large  systems  in 
two  ways.  First,  it  relies  on  the  power  of  parallel  computing 
since  one  compute  core  can  be  assigned  to  each  MPI  process 
(and  therefore  to  each  sub-domain).  These  processes  can  exe¬ 
cute  in  parallel,  constrained  only  by  the  required  communication 
and  synchronization.  Second,  it  allows  access  to  the  larger  mem¬ 
ory  pool  available  on  distributed  memory  architectures.  Whereas 
a  single  node  or  GPU  card  may  have  about  6  GB  of  memory, 
a  distributed  memory  cluster  may  have  on  the  order  of  1TB  of 
memory,  enabling  the  simulation  of  vastly  larger  problems. 

Note  that  the  domain  decomposition  approach  currently  uses 


the  discrete  element  method  to  resolve  friction  and  contact  forces 
between  elements  in  the  system.  The  approach  also  supports  con¬ 
straints  between  bodies  in  the  simulation  by  considering  an  as¬ 
sembly  of  constrained  rigid  bodies  as  a  unit  which  must  always 
be  kept  together.  Therefore,  if  any  body  in  a  chain  of  constrained 
bodies  is  contained  in  a  given  sub-domain,  all  bodies  in  the  chain 
are  considered  by  that  sub-domain  and  used  to  correctly  solve 
the  constraint  equations. 

Sub-division  and  Set-up  A  pre-processing  step  is  used 
to  discretize  the  simulation  domain  into  a  specified  number  of 
sub-domains,  set  up  the  communication  conduits  between  pro¬ 
cesses,  and  initialize  the  sub-domains  as  appropriate.  The  sub¬ 
division  is  based  on  a  cubic  lattice  with  support  for  arbitrary 
sized  divisions.  The  sub-domain  boundaries  are  aligned  with 
the  global  cartesian  coordinate  system,  and  their  locations  are 
user-specified.  Separate  MPI  processes  are  mapped  to  each  sub- 
domain.  Note  that  at  this  time,  the  sub-division  is  static  and  does 
not  change  during  the  simulation.  Therefore,  the  user  should  be 
careful  to  set  up  the  discretization  to  maintain  the  best  possible 
load  balancing. 

In  terms  of  communication,  each  sub-domain  in  the  grid  can 
communicate  with  all  other  sub-domains.  These  communication 
pathways  are  set  up  and  initialized  during  the  pre-processing  step 
and  persist  throughout  the  simulation. 

Note  that  this  implementation  relies  heavily  on  inheri¬ 
tance  and  the  class-based  structure  of  Chrono.  For  exam¬ 
ple,  ChSystem  is  extended  to  ChSystemMPI  by  including 
the  code  to  perform  communication  and  synchronize  the  sub- 
domains. 

Simulation  and  Communication  Each  sub-domain  is 
now  represented  by  a  ChSystemMPI  object  and  an  associated 
MPI  process.  For  example,  assume  a  simulation  is  discretized 
into  a  set  S  of  m  sub-domains.  In  this  case,  let  S  =  {A,B,CD} 
and  m  =  4,  and  map  an  MPI  rank  to  each  sub-domain  so  that 
A  is  mapped  to  MPI  rank  0,  B  -A  1,  C  -A  2,  and  D  -A  3.  Each 
sub-domain  maintains  at  all  times  m +  1  lists  of  objects.  The  first 
list  contains  all  objects  which  are  even  partially  contained  in  the 
associated  sub-domain.  These  are  the  objects  which  must  be  con¬ 
sidered  when  computing  contact  forces,  for  example.  The  next 
m  lists  contain  bodies  which  are  shared  with  other  sub-domains. 

In  our  example,  sub-domain  B  maintains  the  lists  Bq ,  Ba , 
Bb,  Be,  and  Bp.  Bo  is  the  list  of  all  objects  that  intersect  (touch) 
sub-domain  B ,  while  Ba  is  the  list  of  objects  which  are  in  sub- 
domain  A  and  B.  Note  that  sub-domain  A  has  a  list  Ab  which 
should  contain  the  same  objects  as  Ba  .  Further,  list  Bb  is  not  used 
but  is  created  for  the  sake  of  generality.  All  lists  are  maintained 
in  order  sorted  by  object  ID  number  (see  Fig. 7). 

The  sub-domains  are  now  ready  for  time- stepping.  Each 
sub-domain  X  performs  collision  detection  among  all  objects  in 
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FIGURE  7.  (TOP)  Sample  2D  simulation  domain  with  four  sub- 
domains  and  seven  objects.  (BOTTOM)  Corresponding  object  lists  for 
each  sub-domain. 


list  Xq  and  computes  the  associated  collision  forces  based  on  the 
DEM  force  model.  Then,  sub-domain  X  computes  the  net  force 
on  each  object  in  list  Xq ,  taking  into  account  the  contact  forces, 
gravitational  forces,  and  applied  forces. 

Next,  mid-step  communication  occurs.  Sub-domain  X 
should  send  to  each  sub-domain  Y  the  net  force  on  each  body 
in  list  Xy.  Similarly,  X  should  receive  from  each  Y  the  net  force 
on  each  body  in  list  Xy.  Finally,  X  should  compute  the  total  force 
on  each  body  in  list  Xo .  Note  that  X  may  receive  force  contribu¬ 
tions  for  a  given  body  from  any  or  all  of  the  other  sub-domains 
in  the  system. 

At  this  point,  each  sub-domain  X  has  the  true  net  force  on 
each  body  in  its  list  Xo .  Each  sub-domain  can  advance  the  state 
of  its  bodies  in  time  by  one  time  step  by  computing  the  new  ac¬ 
celerations,  velocities,  and  positions  of  all  objects  in  the  sub- 
domain  given  their  mass/inertia  properties  and  the  set  of  applied 
forces.  We  perform  an  end-of-step  communication  to  synchro¬ 
nize  object  states  among  sub-domains.  All  sub-domains  which 
share  a  given  body  should  compute  its  new  state  identically, 
but  due  to  the  potential  for  round-off  error  we  synchronize  the 
state  from  the  master  sub-domain  (where  the  center-of-mass  is 
located)  to  all  others.  The  final  stage  is  to  process  the  m  +  1 


lists  in  each  sub-domain,  as  objects  may  enter  or  leave  a  given 
sub-domain  or  be  shared  between  a  different  set  of  sub-domains, 
necessitating  updates  of  the  contents  of  the  lists. 


Example  Simulation  In  this  example  we  simulate  a 
Mars  Rover  type  wheeled  vehicle  operating  on  granular  terrain. 
The  vehicle  is  composed  of  a  chassis  and  six  wheels  connected 
via  revolute  joints.  The  wheels  are  driven  with  a  constant  an¬ 
gular  velocity  of  n  rad/sec.  The  granular  terrain  is  composed 
of  2,016,000  spherical  particles.  The  simulation  is  divided  into 
64  sub-domains  and  uses  a  time  step  of  10-5  sec.  A  snapshot 
from  the  simulation  can  be  seen  in  Fig. 8.  In  the  figure,  note  that 
the  wheels  of  the  rover  are  checkered  blue  and  white.  This  sig¬ 
nifies  that  the  master  copy  of  the  rover  assembly  is  in  the  blue 
sub-domain  and  the  rover  spans  into  adjacent  sub-domains. 


FIGURE  8.  Snapshot  of  Mars  Rover  simulation  with  2,016,000  ter¬ 
rain  particles  using  64  sub-domains.  Bodies  are  colored  by  sub-domain, 
with  shared  bodies  (those  which  span  sub-domain  boundaries)  colored 
white. 


Validation  and  Demonstration  of  Technology 

This  section  describes  a  validation  effort  in  which  experi¬ 
mental  results  were  compared  to  simulation  results  obtained  from 
Chrono:: Engine.  To  this  end,  a  test  rig  was  designed  and  fabri¬ 
cated  to  measure  the  rate  at  which  granular  material  flowed  out  of 
a  slit  due  to  gravity.  Chrono:: Engine  was  used  to  set  up  a  corre¬ 
sponding  simulation  to  match  the  experimental  results.  For  more 
detail,  see  [17]. 
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Experimental  Model  The  experimental  set-up  consisted 
of  a  fixed  base,  a  movable  wall  (angled  at  45°),  a  translational 
stage,  a  linear  actuator,  and  a  scale  (see  schematic  in  Fig.  9).  The 
linear  actuator  was  capable  of  quickly  opening  a  precise  gap,  out 
of  which  the  granular  material  would  flow  due  to  gravity.  The 
scale  recorded  the  mass  of  collected  granular  material  as  a  func¬ 
tion  of  time.  The  granular  material  consisted  of  approximately 
40,000  uniform  glass  disruptor  beads  with  diameter  of  500  mi¬ 
crons.  Experiments  were  performed  for  gap  sizes  of  1.5  mm,  2 
mm,  2.5  mm,  and  3  mm.  At  least  5  experiments  were  performed 
for  each  gap  size. 


Fixed 

wall 


Scale 

FIG  U  R  E  9.  Schematic  of  validation  experiment.  A  linear  actuator  and 
translational  stage  moved  the  left  angled  side  a  fixed  amount,  opening 
a  precise  gap  from  which  the  particles  flowed.  The  mass  flow  rate  was 
measured  by  the  scale.  Schematic  not  to  scale. 


Simulation  Model  Chrono:: Engine  was  used  to  build  a 
model  representing  the  experimental  set  up  described  above.  In 
the  model,  the  trough  was  represented  by  four  rectangular  boxes 
of  finite  dimensions.  The  motion  of  the  box  representing  the 
angled  side  was  captured  from  the  data  sheet  of  the  translational 
stage.  The  granular  material  was  modeled  as  perfect,  identical 
spheres  with  the  same  mass  and  coefficient  of  friction. 

The  load  cell  measured  the  outflow  through  the  gap.  In  the 
simulation,  the  scale  was  modeled  by  counting  the  number  of 
spheres  below  a  certain  height.  The  number  of  spheres  multi¬ 
plied  by  the  mass  and  gravity  yielded  the  weight  which  was  com¬ 
pared  with  experimental  results.  A  plane  was  used  to  contain  the 
spheres  after  they  had  been  counted. 

In  order  to  save  computational  time,  the  simulation  was  split 
into  two  parts:  one  representing  the  process  of  filling  the  trough 
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and  the  other  the  opening  and  measuring  process.  In  this  way,  the 
trough  was  filled  with  randomly  positioned  spheres  which  were 
allowed  to  settle.  Once  the  kinetic  energy  of  the  system  was 
below  0.001  Joules  and  had  reached  a  relatively  constant  value, 
the  x-,  y- ,  and  z-position  of  each  sphere  was  saved  to  a  file. 

The  same  initial  conditions  from  the  settling  simulation  were 
used  to  perform  all  of  the  necessary  simulations.  At  the  begin¬ 
ning  of  each  simulation  the  position  data  set  of  the  spheres  was 
loaded  into  the  model  and  the  spheres  were  created  at  the  same 
positions  they  appeared  in  the  filling  process.  The  motion  was 
applied  to  the  translating  side  to  achieve  the  desired  gap  size, 
and  the  material  began  to  flow. 

The  simulations  setup  consisted  of  39,000  rigid  body 
spheres  with  a  radius  of  2.5  x  10—4  m  and  a  mass  of  1.631  x 
10-7  kg.  The  following  parameters  were  set  for  this  simulation. 
A  time  step  of  10— 4  [s]  with  500  CCP  iterations,  and  a  tolerance 
of  10-7  for  the  maximum  velocity  correction.  Simulations  were 
generally  run  for  8  seconds.  SI  units  were  used  for  all  parame¬ 
ters. 


Procedure  Used  to  Select  the  Friction  Coefficient 

The  friction  coefficient  of  a  certain  material  is  not  a  constant 
value.  It  can  depend  on  various  environmental  influences  such 
as  humidity,  surface  quality,  temperature  etc.  The  friction  coef¬ 
ficient  of  glass  was  an  unknown  in  the  validation  process  and 
needed  to  be  determined  before  further  observations  could  be 
done.  To  achieve  this,  one  experiment  at  a  gap  size  of  1.5  mm 
was  performed  and  multiple  simulations  with  the  same  setup  and 
different  friction  coefficients  were  performed.  The  simulation  re¬ 
sults  were  compared  to  the  experimental  test  results  to  determine 
which  friction  coefficient  resulted  in  the  best  match.  It  was  de¬ 
termined  that  fd  =  0.15  most  closely  matched  the  experimental 
results.  This  value  was  used  for  all  subsequent  simulations. 


Results  The  weight  of  the  collected  granular  material  is 
plotted  versus  time  in  Fig.  10  through  Fig.  13.  The  mass  flow 
rate  is  proportional  to  the  slope.  For  example,  Fig.  10  shows  the 
results  with  a  gap  size  of  3mm  and  5  experimental  runs  (dashed 
lines  represent  experimental  data,  while  the  solid  red  line  rep¬ 
resents  simulation  data).  The  average  slope  of  the  experimental 
runs  for  this  test  was  1.41E-2  [N/s]  and  the  slope  of  the  simula¬ 
tion  was  1 .40E-2  [N/s] . 

The  experimental  flow  rate  is  based  off  of  the  average  of  sev¬ 
eral  experimental  runs  and  the  uncertainty  of  the  rates  is  deter¬ 
mined  by  a  Student’s  T  distribution  with  95%  confidence.  Based 
on  an  error  of  less  than  2%,  the  simulated  results  match  the  ex¬ 
perimental  results  well.  Despite  this,  there  are  several  factors  that 
could  be  improved  upon.  Namely,  the  design  of  the  rig  did  not 
exactly  match  the  simulated  trough.  The  rig  was  machined  from 
aluminum,  whereas  the  trough  in  the  simulation  had  the  same  co¬ 
efficient  of  friction  as  the  granular  material.  Likewise,  the  sim- 
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Time  (sec) 

FIGURE  10.  Weight  vs  time  for  a  gap  size  of  3  mm. 


Mass  Flow  Rate  2.5mm 


FIGURE  1 1 .  Weight  vs  time  for  a  gap  size  of  2.5  mm. 


Mass  Flow  Rate  2.0mm 


FIGURE  12.  Weight  vs  time  for  a  gap  size  of  2  mm. 


ulation  was  conducted  in  essentially  vacuum,  ignoring  any  aero¬ 
dynamic  forces,  while  the  experiment  was  performed  at  ambient 
conditions.  Lastly,  the  effects  of  humidity  were  not  taken  into 
account  by  the  simulation. 


FIGURE  13.  Weight  vs  time  for  a  gap  size  of  1.5  mm.  This  was  the 
test  case  that  was  used  for  calibration. 


Chrono::Flex 

The  Chrono::Flex  software  is  a  general-purpose  simulator 
for  three  dimensional  flexible  multi-body  problems  and  provides 
a  suite  of  flexible  body  support.  The  features  included  in  this 
module  are  multiple  element  types,  the  ability  to  connect  these 
elements  with  a  variety  of  bilateral  constraints,  multiple  solvers, 
and  contact  with  friction.  Additionally,  Chrono::Flex  leverages 
the  GPU  to  accelerate  solution  of  large  problems. 

Element  Types 

Chrono::Flex  includes  two  element  types  implemented  us¬ 
ing  the  Absolute  Nodal  Coordinate  Formulation  (ANCF)  [18, 
19].  The  gradient-deficient  beam  element  and  the  gradient- 
deficient  plate  element  are  described  below. 

Gradient-Deficient  Beam  Elements  This  implemen¬ 
tation  uses  gradient  deficient  ANCF  beam  elements  to  model 
slender  beams,  examples  of  which  are  shown  in  Figure  14.  These 
are  two  node  elements  with  one  position  vector  and  only  one  gra¬ 
dient  vector  used  as  nodal  coordinates.  Each  node  thus  has  6  co¬ 
ordinates:  three  components  of  the  global  position  vector  of  the 
node  and  three  components  of  the  position  vector  gradient  at  the 
node.  This  formulation  displays  no  shear  locking  problems  for 
thin  and  stiff  beams  and  is  computationally  more  efficient  com¬ 
pared  to  the  original  ANCF  due  to  the  reduced  number  of  nodal 
coordinates  [20].  The  gradient  deficient  ANCF  beam  element 
does  not  describe  a  rotation  of  the  beam  about  its  own  axis  so  the 
torsional  effects  cannot  be  modeled. 

Gradient-Deficient  Plate  Elements  Much  like  beams, 
numerical  difficulties  are  encountered  in  the  fully  parameterized 
plate  element  when  the  system  has  very  thin  and  stiff  compo¬ 
nents  [21].  The  high  frequencies  that  are  induced  along  the  thin 
direction  of  the  element  require  an  extremely  small  time  step,  re- 
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FIGURE  14.  Two  models  with  friction  and  contact  using 
Chrono::Flex  beam  elements:  a  ball  sitting  on  grass-like  beams 
and  a  ball  hitting  a  net. 


suiting  in  longer  simulation  times.  In  the  case  where  the  aspect 
ratio  (length  divided  by  thickness)  of  the  element  is  high,  plane 
stress  assumptions  can  be  made  that  allow  a  reduced-  order  ele¬ 
ment  to  be  accurate.  Specifically,  Kirchhoff’s  plate  theory,  which 
does  not  account  for  shear  deformation,  is  used  and  results  in  an 
element  with  36  degrees  of  freedom,  or  nodal  coordinates,  shown 
in  Figure  15. 


§ 

FIGURE  15.  Two  models  with  friction  and  contact  using 
Chrono::Flex  plate  elements:  a  cloth  hanging  on  a  sphere  and  a 
closed  contour  shaped  like  a  tire. 


Kinematic  Constraints 

Several  types  of  mechanical  joints  are  modeled  in 
Chrono::Flex.  A  spherical  joint  [22]  between  two  nodes  of  any 
two  bodies  will  require  the  position  vector  of  each  node  to  be 
identical.  A  revolute  joint  will  have  two  additional  constraints 
to  the  spherical  joint  constraints.  In  this  case,  the  gradient  vec¬ 
tors  of  the  two  nodes  will  remain  in  a  plane  perpendicular  to  the 
axis  of  revolute  joint.  There  are  also  additional  constraints  due  to 
the  element  connectivity  in  each  beam.  The  element  connectiv¬ 
ity  can  be  modeled  as  a  fixed  joint  between  the  nodes.  Here  the 
common  node  between  two  elements  is  treated  as  two  different 
nodes  attached  to  each  other  through  the  fixed  joint.  This  fixed 
joint  requires  all  the  nodal  coordinates  of  the  two  nodes  be  iden¬ 


tical.  The  generalized  coordinates  of  the  system  change  in  time 
under  the  effect  of  applied  forces  such  that  these  constraint  equa¬ 
tions  are  satisfied  at  all  times.  The  time  evolution  of  the  system 
is  governed  by  the  Lagrange  multiplier  form  of  the  constrained 
equations  of  motion. 

Solvers 

The  equations  shown  in  Figure  16  form  a  system  of  index- 3 
Differential  Algebraic  Equations  (DAEs).  Although  several  low 
order  numerical  integration  schemes  have  been  effectively  used 
to  solve  index-3  DAEs,  Chrono::Flex  utilizes  the  Newmark  inte¬ 
gration  scheme  [23].  Originally  used  in  the  structural  dynamics 
community  for  the  numerical  integration  of  a  linear  set  of  second 
order  ODEs,  it  was  adapted  for  the  discretization  of  DAEs.  This 
implicit  solver  was  proved  to  have  convergence  of  order  1  or  2, 
depending  on  the  choice  of  parameters  y  and  /3 . 


Generalized  Accelerations -  _ 

Reaction  Elastic 

Generalized  Mass  Matrix  — i  Force  Force 

_  i — - 1  i - * - 1 

Force  Balance  Equations  |  >  Mq  +  Oq7  (q,/)A,  +  Qint(q)  =  Qext(q,q,/) 


Generalized  Positions 


Holonomic  Kinematic  Constraints  | — ►  ®(q,f)  =  0 


Applied  Force 


FIGURE  16.  The  equations  of  motion  for  Chrono::Flex. 


At  each  time  step,  the  numerical  solution  commences  by 
solving  the  nonlinear  set  of  equations  shown  in  Figure  17.  The 
numerical  solution  of  the  nonlinear  algebraic  system  falls  back 
on  a  Newton-type  iterative  algorithm  that  requires  the  computa¬ 
tion  of  its  sensitivity  matrix.  Advancing  the  numerical  solution  in 
time  draws  on  three  loops:  the  outer-most  loop  marches  forward 
in  time,  while  at  each  time  step  the  second  loop  solves  the  alge¬ 
braic  discretization  problem  in  Figure  17.  Each  iteration  in  this 
second  loop  launches  a  third  loop  whose  role  is  that  of  producing 
a  vector  of  corrections  for  the  acceleration  and  Lagrange  multi¬ 
pliers.  The  corrections  are  computed  using  the  BiCGStab  itera¬ 
tive  solver  [24],  which  also  provides  for  a  matrix-free  solution. 
A  serial  solver  was  implemented  using  the  Armadillo  Matrix  Al¬ 
gebra  Library  [25]  and  a  GPU  parallel  solver  was  implemented 
using  CUSP  [26],  a  linear  algebra  library  built  on  top  of  CUDA. 
Chrono::Flex  was  validated  in  [27]  as  well  as  in  [28]  against  the 
commercial  code  ADAMS  [29],  and  the  nonlinear  finite  element 
analysis  code  ABAQUS  [30]. 


Chrono::Fluid 

The  Chrono::  Fluid  component  aims  at  leveraging  GPU 
computing  to  efficiently  simulate  fluid  dynamics  and  fluid- solid 
interaction  problems.  Fluid-Solid  Interaction  (FSI)  covers  a 
wide  range  of  applications,  from  blood  and  polymer  flow  to 


9 


Copyright  ©  2013  by  ASME 


UNCLASSIFIED 


rTime  Step  Index 
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I  Z 
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Velocity  Update] — >  q„+1  =  q„  +  /z[(l  -  y)q„  +  /qw+,  ] 

' —  Generalized  Velocities 

Force  Balance  Equations  |  ►  (Mq)w+1  +  (Oq7  X  )  ^  +  (Qint  -  Qext ) +|  =  0 

Reaction  Force  Applied  Force 
Holonomic  Kinematic  Constraints] - >  0(qw+1,4+i) =  0 

FIGURE  17.  The  discretized  equations  of  motion  for  Chrono::Flex 
(fully  implicit). 


tanker  trucks  and  ships.  Simulation  of  the  FSI  problem  requires 
two  components:  Fluid  and  Solid  simulations.  Simulation  of 
the  Solid  phase,  either  rigid  or  flexible,  in  an  HPC  fashion,  is 
described  in  previous  sections.  To  leverage  the  existing  solid 
phase  simulation,  the  fluid  flow  simulation  should  satisfy  some 
conditions,  introduced  by  the  aforementioned  target  problems. 
First,  the  fluid  flow  may  experience  large  domain  deformation 
due  to  the  motion  of  the  solid  phase.  Second,  the  two  phases 
should  be  coupled  via  an  accurate  algorithm.  Third,  target 
problems  may  experience  free  surface  as  well  as  internal  flow. 
Finally,  the  whole  simulation  should  be  capable  of  an  HPC 
implementation  to  maintain  the  scalability  of  the  code. 

Fluid  flow  can  be  simulated  in  either  an  Eulerian  or  a  Lagrangian 
framework.  Provided  that  the  interfacial  forces  are  captured 
thoroughly,  the  Lagrangian  framework  is  capable  of  tracking  the 
domain  deformation  introduced  by  the  motion  of  the  solid  phase 
at  almost  no  extra  cost.  Smoothed  Particle  Hydrodynamics 
(SPH)  [31,32,33],  its  modifications  [34,35],  and  variations  [36] 
have  been  widely  used  for  the  simulation  of  the  fluid  domain  in 
a  Lagrangian  framework.  The  main  evolution  equations  of  the 
fluid  flow  using  SPH  are  expressed  as 


d^=PaJL^r(va-vb).VaWab  (1) 

at  “  Pb 


dVg 

dt 


{l^a  H-  l^b)rab'^ aWab  f 

"■*)  f‘ 
(2) 


which  are  solved  in  conjunction  with 


is  the  representative  fluid  mass  assigned  to  the  SPH  marker,  W 
is  a  kernel  function  which  smooths  out  the  local  fluid  properties 
within  a  resolution  length  /  =  Kh,  and  r ab  is  the  distance  between 
two  fluid  markers  denoted  by  a  and  b.  Lluid  flow  evolution 
equations,  defined  by  Eqs.  (1)  to  (3),  are  solved  explicitly,  where 
pressure  is  related  to  density  via  an  appropriate  state  equation  to 
maintain  the  compressibility  below  1%.  To  increase  the  accuracy 
and  stability  of  the  simulation,  an  XSPH  modification  [34]  and 
Shephard  filtering  [37]  were  applied. 


FSI  with  Smoothed  Particle  Hydrodynamics:  A  Quick 
Overview 

A  proper  choice  of  fluid-solid  coupling  should  satisfy  the 
no- slip  and  impenetrability  conditions  on  the  surface  of  the  solid 
obstacles.  By  attaching  Boundary  Condition  Enforcing  (BCE) 
markers  on  the  surface  of  the  solid  objects,  the  local  relative  ve¬ 
locity,  i.e.  at  the  markers  location,  of  the  two  phases  will  be  zero 
(Fig.  18).  The  position  and  velocity  of  the  BCE  markers  are  up¬ 
dated  according  the  motion  of  the  solid  phase,  which  results  in 
the  propagation  of  the  solid  motion  to  the  fluid  domain.  On  the 
other  hand,  the  interaction  forces  on  the  BCE  markers  are  used 
to  calculate  the  total  force  and  torque  exerted  by  the  fluid  on  the 
solid  object. 


FIGURE  18.  Coupling  of  the  fluid  and  solid  phases.  BCE  and  fluid 
markers  are  represented  by  black  and  white  circles,  respectively. 


FSI  with  Smoothed  Particle  Hydrodynamics:  Prox- 

dx/dt  =  v  (3)  imity  Computation  The  overall  simulation  of  the  FSI  frame¬ 

work  is  performed  in  parallel,  where  each  thread  handles  the 
to  update  the  fluid  properties.  In  Eqs.  (1)  to  (3),  p,  v,  and  force  calculation  of  a  fluid  or  BCE  marker  first,  and  a  rigid  body 

p  are  local  fluid  density,  velocity,  and  pressure,  respectively,  m  later.  Next,  the  parallel  threads  perform  the  kinematics  update  of 
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the  fluid  markers,  rigid  bodies,  and  BCE  markers,  respectively. 
An  essential  part  of  the  force  calculation  stage  is  the  proximity 
computation,  which  will  be  explained  briefly  herein. 

Proximity  computation  used  in  our  work  leverages  the  algorithm 
provided  in  CUDA  SDK  [38],  where  the  computation  domain 
is  divided  into  bins  whose  sizes  are  the  same  as  the  resolution 
length  of  the  SPH  kernel  function.  A  hash  value  is  assigned  to 
each  marker  based  on  its  location  with  respect  to  the  bins.  Mark¬ 
ers  are  sorted  based  on  their  hash  value.  The  sorted  properties  are 
stored  in  independent  arrays  to  improve  the  memory  access  and 
cache  coherency.  To  compute  the  forces  on  a  marker,  the  lists  of 
the  possible  interacting  markers  inside  its  bin  and  all  26  neighbor 
bins  are  called.  The  hash  values  of  the  bins  are  used  to  access  the 
relevant  segments  of  the  sorted  data. 


Chrono::Render 

Chrono:: Render  is  a  software  package  that  enables  simple, 
streamlined,  and  fast  visualization  of  arbitrary  data  using  Pixar’s 
RenderMan  [41].  Specifically,  Chrono:: Render  contains  a  hy¬ 
brid  of  processing  binaries  and  Python  scripting  modules  that 
seek  to  abstract  away  the  complexities  of  rendering  with  Ren¬ 
derMan.  Additionally,  Chrono:: Render  is  targeted  for  providing 
rendering  as  an  automated  post-processing  step  in  a  remote  sim¬ 
ulation  pipeline,  hence  it  is  controlled  via  a  succinct  XML  speci¬ 
fication  for  “gluing”  together  rendering  with  arbitrary  processes. 
As  seen  in  Figure  20,  Chrono:: Render  combines  simulation  data, 
XML  describing  how  to  use  the  data,  and  optional  user-defined 
Python  scripts  into  a  complex,  visually -rich  scene  to  be  rendered 
by  RenderMan. 


Validation  and  Demonstration  of  Technology 

The  aforementioned  FSI  simulation  engine  was  used  to 
validate  the  lateral  migration  of  cylindrical  particles  in  plane 
Poiseuille  flow,  spherical  particles  in  pipe  flow,  and  particle  dis¬ 
tribution  in  Poiseuille  flow  of  suspension  [39,40].  Due  to  the 
scalability  of  Chrono: :Fluid  in  both  fluid  and  solid  phases,  in¬ 
creasing  the  number  of  rigid  bodies,  which  translates  into  de¬ 
creasing  the  number  of  fluid  particles,  does  not  affect  the  simula¬ 
tion  time.  Therefore,  the  simulation  of  a  highly  dense  suspension 
is  possible.  Figure  19  shows  the  result  of  the  simulation  of  the 
flow  of  suspension  including  1,500  particles  through  a  channel. 
A  similar  scenario  with  13,000  particles  in  suspension  was  suc¬ 
cessfully  simulated  in  Chrono:: Fluid. 


FIGURE  19.  Simulation  of  rigid  bodies  inside  a  fluid  flow:  Rigid 
ellipsoids  with  their  BCE  markers  are  shown  in  the  left  image  while  the 
fluid’s  velocity  contours  and  rigid  ellipsoids  at  the  mid- section  of  the 
channel  are  shown  in  the  right  image. 


FIGURE  20.  Chrono:: Render  architecture. 


On  the  Choice  of  RenderMan 

Using  RenderMan  for  rendering  is  motivated  by  the  scope  of 
arbitrary  data  sets  and  the  potentially  immense  scene  complexity 
that  results  from  big  data;  REYES,  the  underlying  architecture 
for  RenderMan  is  ideally  suited  for  this  task.  REYES  works  by 
dividing  each  surface  in  the  scene  into  a  grid  of  micropolygons 
and  shades  at  the  grid  vertices  [42]  (see  Figure  21). 

This  results  in  tractable  rendering  for  complex  scenes  be¬ 
cause:  (a)  only  a  small  portion  of  the  scene  needs  to  be  in  mem¬ 
ory  at  any  given  time;  (b)  grid-based  computation  leads  to  opti¬ 
mal  memory  access  patterns;  ( c )  non- visible  objects  need  not  be 
loaded  into  memory;  ( d )  fully -rendered  objects  can  be  removed 
from  memory;  and  ( e )  objects  are  tessellated  according  to  size  on 
the  screen;  less  complex  geometry  is  dynamically  loaded  when¬ 
ever  possible. 
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FIGURE  21 .  An  overview  of  the  REYES  Pipeline. 


REYES  is  perfectly  suited  for  parallel  processing  since  it 
scales  linearly  with  the  number  of  cores.  Considering  that 
REYES  needs  only  a  handful  of  relevant  scene  elements  at  a 
time,  this  data  can  be  parsed  into  low-memory  buckets  and  dis¬ 
tributed  amongst  cores  for  parallel  rendering;  thus  REYES’  low 
memory-footprint  and  efficient  concurrent  resource  usage  for 
the  complex  scenes  makes  it  a  great  renderer  for  a  distributed- 
computing  platform. 

Accessibility  of  High-Quality  Graphics 

Although  REYES  can  manage  the  issue  of  scene  complex¬ 
ity,  leveraging  this  power  is  difficult  without  computer  graphics 
expertise.  The  guiding  principle  of  Chrono:: Render  is  to  make 
high-quality  rendering  available  to  researchers,  most  of  whom 
don’t  have  the  background  or  bandwidth  to  spend  time  learn¬ 
ing  how  to  use  complex  graphics  applications  or  make  sense 
of  REYES’  intricacies.  Consequently,  Chrono:: Render  encapsu¬ 
lates  into  the  XML  specification  the  complicated  steps  needed  to 
make  interesting  visual  effects,  such  as  multipass  rendering.  The 
user  must  only  instance  the  correct  XML  components  to  achieve 
high-quality  renders.  The  program  flow  of  Chrono:: Render  is 
shown  in  Figure  22. 


FIGURE  22.  Chrono ::Render  execution  workflow. 


The  XML  specification  allows  for  the  concise  expression  of 
salient  features  and  scene  objects.  For  example,  the  snippet  in 
Figure  23  illustrates  the  XML  file  that  translates  a  single  line 


from  a  CSV  data  file  into  a  RenderMan  sphere  using  two  shaders. 


FIGURE  23.  Simple  XML  for  a  sphere  with  a  Surface  and  Displace¬ 
ment  shader. 


Although  simple,  the  render  is  visually  rich.  This  descrip¬ 
tion  is  often  enough  to  visualize  most  generic  data,  but  it  cannot 
handle  all  arbitrary  visualizations,  so  in  order  to  maintain  gener¬ 
ality  we  make  use  of  Python  scripts  and  wrappers  to  enable  sim¬ 
plified  procedural  RenderMan  Interface  Bytestream  generation. 
Any  XML  element  can  be  scripted  such  that  at  runtime,  the  script 
output  will  be  piped  into  the  same  rendering  context.  This  makes 
it  possible  to  perform  processing  for  specialized  data  as  well  as 
modularize  the  rendering  of  specific  effects.  Obviously  this  adds 
more  complexity  for  defining  the  scene,  but  Chrono:: Render  pro¬ 
vides  Python  modules  with  methods  and  classes  intended  to  ease 
this  programming  as  much  as  possible.  Additionally,  most  of  the 
Chrono:: Render  Python  modules  wrap  C++  functions  and  classes 
with  the  purpose  of  exploiting  speed  while  still  making  use  of  the 
syntactical/type-free  simplicity  of  Python.  Figure  24  gives  an  ex¬ 
ample  of  combining  XML  with  Python  scripts  to  achieve  a  more 
complicated  render. 

Other  Capabilities 

Beyond  interpreting  parameters  and  data  into  RenderMan 
calls,  Chrono:: Render  provides  tools  for  bootstrapping  rendering 
projects.  Chrono:: Render  can:  (a)  construct  directory  structures 
for  localizing  and  managing  scene  resources;  (b)  automate  dis¬ 
tribution  of  rendering  across  a  multi-node  network;  (c)  convert 
common  graphics  file  formats  into  RenderMan  file  formats  such 
as  Wavefront  Objs  and  Mtls  to  RenderMan  RIBs  and  Shaders; 
(d)  generate  XML  for  automatically  adding  parameters  to  the 
scene  description  for  describing  advanced  visual  effects  such  as 
subsurface  scattering,  ambient  occlusion,  reflections,  etc.;  (e) 
mesh  point-clouds,  particularly  useful  for  particle-based  fluid 
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XML 

<renderobject> 

<string  name- name"  value="rover7> 

<vector  name="range"  value="0  1007> 

<string  name-'renderscript"  value- ’Rover.py7> 
</renderobject> 

<renderobject> 

<string  name-'name"  value="rocks_brown7> 
<vector  name=”range”  value=”101  500007> 
<geometry  name=”sphere7> 

<vector  name=”color”  value-  '0.82  0.71  0.557> 
<shader  type="Surface"  name=”plastic7> 
</renderobject> 

<renderobject> 

<string  name- name"  value-”rocks_grey7> 
<vector  name="range"  value="50001  1000007> 
<geometry  name=”sphere7> 

<vector  name=”color”  value=”0.772  0.756  0.667> 
<shader  type="Surface”  name=”plastic7> 
</renderobject> 

<renderpass> 

<string  name-name”  value=”amb_occlusion7> 
<string  name="renderscript”  value=”AO.py7> 
</renderpass> 


Rover. py 

import  crender 

data  =  cman.dataobj("rover") 

for  robj  in  data: 

geotype  =  robj[“typeT  #  find  user-tagged  data 
if  geotype  ==  0: 

#Render  Cylinder  BIG 
RiCylinder(l) 
if  geotype  ==  1 : 

#Render  Cylinder  small 
if  geotype  ==  2: 

#Render  Rover  Geometry 
RiArchivefrover.rib") 


FIGURE  24.  General  purpose  rendering  with  Chrono:: Render.  The 
Rover  body  contains  multiple  shape  descriptions  of  which  are  generated 
from  a  Python  script.  Data  is  tagged  with  a  name  which  can  be  later  be 
accessed  using  some  of  Chrono:  :Render’s  Python  functionality. 


ChronoAvailability 

Major  releases  of  the  Chrono:: Engine  software  are 
available  from  the  Chrono:: Engine  website  at  http:// 
chronoengine  .info.  Chronoin  its  entirety  can  be  down¬ 
loaded  from  http  :  //sbel .  wise  .  edu/ chrono.  The  latter 
site  also  displays  the  nightly  build  status  for  various  platforms 
and  unit  testing  results. 
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simulations;  and  (/)  dump  the  generated  RenderMan  calls  to 
disk  for  reuse. 

Chrono :Render  is  currently  available  for  free  download  as  a 
pre-built  binary  for  Linux.  Members  of  the  Wisconsin  Applied 
Computing  Center  can  use  this  capability  remotely  as  a  service 
by  leveraging  320  AMD  cores  on  which  Chrono:: Render  is  cur¬ 
rently  deployed. 


Conclusions  and  Future  Work 

The  Chronosimulation  package  is  composed  of  a  collec¬ 
tion  of  components  designed  to  perform  multi-physics  simula¬ 
tions  leveraging  emerging  high-performance  computing  hard¬ 
ware.  Chrono:: Engine  provides  support  for  rigid  body  dynamics, 
focusing  on  large  granular  dynamics  problems,  Chrono: :Flex  en¬ 
ables  simulation  of  flexible  beam  and  plate  elements  interacting 
through  contact  and  bilateral  constraints,  while  Chrono: :Fluid 
allows  the  simulation  of  fluid  flows  and  fluid- solid  interaction 
problems.  Finally,  Chrono:: Render  provides  high-quality  visu¬ 
alization  of  arbitrary  simulation  data  from  the  other  Chrono- 
components.  These  components  have  been  designed  to  lever¬ 
age  high-performance  computing  hardware  whenever  possible. 
Chrono:: Engine  supports  CPU  parallelism  through  a  domain- 
decomposition  approach,  while  Chrono:: Engine,  Chrono: :Flex, 
and  Chrono: :Fluid  all  support  GPU  parallelism  to  further  im¬ 
prove  simulation  performance. 

While  these  components  provide  useful  simulation  capabil¬ 
ities  on  their  own,  ongoing  work  seeks  to  further  integrate  the 
various  Chronocomponents. 
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